11.2 - Advanced Concepts to Explore - HRL, MARL

The monolithic, single-agent architectures we have implemented are powerful but have inherent limitations in solving problems that require both long-term strategy and fine-grained, parallel control. To address this, the field of reinforcement learning offers more advanced architectural paradigms.

This section provides a high-level overview of two of the most important of these: Hierarchical Reinforcement Learning (HRL) and Multi-Agent Reinforcement Learning (MARL).

Important Disclaimer- Beyond `stable-baselines3`

The standard stable-baselines3 library is a single-agent framework. It is not designed to handle the multi-policy systems required for HRL or the decentralized nature of MARL out-of-the-box. Implementing these paradigms requires specialized libraries (like ray[rllib]) or significant custom development. The purpose of this section is to introduce the concepts so you can recognize when they are the appropriate next step in your research.

Hierarchical Reinforcement Learning (HRL)

HRL addresses the challenge of long-term credit assignment by decomposing a single, complex policy into a hierarchy of policies, each operating at a different level of temporal abstraction.

The Architectural Pattern- Manager and Worker

Manager (High-Level Policy): Observes the global game state and selects a high-level goal (e.g., "expand now," "attack enemy base"). It operates on a slow timescale.
Worker (Low-Level Policy): Receives the current state and the manager's current goal. Its job is to output the low-level actions (move, attack, build) necessary to achieve that specific goal.

Information Flow:

+--------------------------+
|      Global State        |
+------------+-------------+
             |
             v
+--------------------------+  (Selects Goal)  +----------------+
|      Manager Policy      |----------------->|      Goal      |
| (Outputs a sub-goal)     |                  |("Attack Base") |
+--------------------------+                  +-------+--------+
                                                      |
                                                      v
+-----------------------------------------------------+
|      Worker Policy                                  |
| (Receives State + Goal, outputs low-level actions)  |
+-----------------------------------------------------+

HRL Solves	HRL Introduces a Challenge
Long-Term Planning: The manager only needs to learn high-level strategy, a much simpler problem than micro-managing every unit.	Goal and Reward Definition: Defining a useful set of abstract sub-goals and designing the reward functions to train the worker policy are non-trivial engineering tasks.

Multi-Agent Reinforcement Learning (MARL)

MARL addresses problems requiring decentralized, parallel control by modeling each entity as its own independent, learning agent.

The Architectural Pattern- A Team of Agents

Each unit (or squad) is an individual agent with its own policy network.
Each agent receives a local observation of the environment.
All agents act concurrently, and the system learns effective emergent behavior from their collective actions.

Information Flow:

                           +-------------------+
                           |  Shared SC2       |
                           |  Environment      |
                           +--+-------------+--+
        (obs_1, rew_1)     ^             ^        (obs_2, rew_2)
            |              |             |              |
            v              |             |              v
+-----------+------------+ |             | +------------+-----------+
|    Agent 1 (Policy 1)  | |             | |    Agent 2 (Policy 2)  |
| (Outputs action_1)     | |             | | (Outputs action_2)     |
+------------------------+ |             | +------------------------+
            |              |             |              |
            +------------->+             +<-------------+
                 action_1                     action_2

MARL Solves	MARL Introduces a Challenge
Fine-Grained Micro: It is a natural fit for complex squad-level combat, allowing for highly reactive and coordinated control.	Non-Stationarity: From the perspective of any one agent, the environment is constantly changing as its teammates' policies evolve, which can make training unstable. This often requires specialized algorithms (e.g., Centralized Training, Decentralized Execution).

Where to Go Next

To implement these advanced concepts, you will need to explore frameworks designed for them:

For HRL and MARL: Ray RLlib is a powerful, industry-standard library that supports a wide variety of advanced RL paradigms.
For MARL: PettingZoo is a popular library that provides a gymnasium-like API specifically for multi-agent environments.

Important Disclaimer- Beyond stable-baselines3​

Hierarchical Reinforcement Learning (HRL)​

Multi-Agent Reinforcement Learning (MARL)​

Where to Go Next​

Important Disclaimer- Beyond `stable-baselines3`

Hierarchical Reinforcement Learning (HRL)

Multi-Agent Reinforcement Learning (MARL)

Where to Go Next